A Corpus-Based Statistics-Oriented Transfer and Generation Model for Machine Translation
نویسنده
چکیده
In this paper, a corpus-based approach for acquiring transfer rules and selecting the most preferred transfer between a language pair is proposed. A transfer score is defined to measure the preference of different mappings between the source-target sentence pair, and a generation score is defined to provide a probabilistic mechanism for finding the most preferred generation pattern. An algorithm is proposed to find the appropriate transfer units within a syntax tree and the corresponding transfer rules. By applying such an algorithm, the preferred generation rules can be learned directly from the target language so that the generation of the target sentences can be tuned to follow the grammar and style of the target language, instead of being bounded by the analysis grammar of the source language. 1. Overview of Transfer Models In a transfer-based machine translation system, a source sentence is analyzed into an intermediate representation for the source language. The intermediate representation is then transferred into its target equivalent. Finally, the target surface strings are generated according to the intermediate representation of the target sentence. More specifically, the major tasks for transfer and generation include (1) reducing the analysis result into an intermediate form that is suitable for transfer, (2) selecting appropriate target words for source words, (3) making appropriate mapping from the source form to the target structure, and (4) generating the target equivalent from the target representation. (The word 'transfer' will sometimes be used ambiguously to refer to both transfer and generation.) 1.1. Rule-Based Approaches A common transfer approach is to carry out a sequence of tree to tree mapping, either at syntactic level or semantic level, by using a set of source-target transfer pattern-action pairs to reflect the changes in substructures and linear order in the language pair [Benn 85, Naga 85, Tsut 90]. The major problems with such an approach are the coverage of the transfer rules (or patterns) and the consistency among the rules in the context of a wide variety of application domains. It is hard and costly to acquire a complete and consistent set of transfer rules manually. Notably, it is nontrivial to identify the appropriate atomic units for transfer. The large set of transfer rules thus imposes nontrivial acquisition and maintenance problems as the system scales up. Additionally, in most transfer-based systems that follow a "one-way" analysis-transfergeneration process, the generated sentences are often strongly bounded to the analysis grammar since the generation rules are influenced greatly by the source language. The generation grammar might preserve much stylistic characteristics of the source language such that the generated sentences are unnatural to the native speakers [Su 93]. Our experiences with the BehaviorTran (formerly the ArchTran) MTS [Chen 91] show that such translation quality is still far below the user
منابع مشابه
Why Corpus-Based Statistics-Oriented Machine Translation
Rule-based approaches have been the dominant paradigm in developing MT systems. Such approaches, however, suffer from difficulties in knowledge acquisition to meet the wide variety and time-changing characteristics of the real text. To attack this problem, some statistical translation models and supporting tools had been developed in the last few years. However, a simple statistical model often...
متن کاملHandling Translation Divergences in Generation-Heavy Hybrid Machine Translation
This paper describes a novel approach for handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system. The approach depends on the existence of rich target language resources such as word lexical semantics, including information about categorial variations and subcate-gorization frames. These resources are used to generate multiple structural variations from ...
متن کاملTreebanks in Machine Translation
We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus,...
متن کاملA Direction of MT Development
The two most recently popular technological paradigms in machine translation — examplebased translation (EBMT) and statistics-based translation (SBMT) — require knowledge about language only as an afterthought. While the representatives of the above paradigms are still at the stage of either building toy systems (e.g., Furuse and Iida, 1992; McLean, 1992, Jones, 1992, Maruyama and Watanabe, 199...
متن کاملCorpus-Based Statistics-Oriented (CBSO) Machine Translation Researches in Taiwan
A brief introduction to the MT research projects in Taiwan is given in this paper. Special attention is given to the more and more popular corpus-based statistics-oriented (CBSO) approaches in MT researches. In particular, the parameterized two-way training philosophy in designing the second generation BehaviorTran, which is the first and the largest operational system in this area, is introduc...
متن کامل